You have 2 free member-only stories left this month.
9 Deep Learning Papers That You Must Know — Part 1
AlexNet — The paper that changed how we perform deep learning
LSVRC a.k.a. Large Scale Visual Recognition Challenge is a competition where research teams evaluate their algorithms on a huge dataset of labelled images (ImageNet), and compete to achieve higher accuracy on several visual recognition tasks.
This competition has been going on since 2010 and happens every year. AlexNet is the name of a convolutional neural network that won the competition in year 2012. It was designed by Alex Krizhevsky, Ilya Sutskever and Krizhevsky’s PhD advisor Geoffrey Hinton. Geoffrey Hinton, the winner of this year’s $1M Turing award, was originally resistant to the idea of his student.
The popularity of this paper can be seen through the number of its citations alone
Top 7 cool things about this paper
- Depth — Layers : 8 (5 Convolutional + 3 Fully Connected), Parameters : 60 Million, Neurons : 650 Million
- Activation Function — Non-Linearity Used : ReLU instead of TanH
- Speed — GPUs : 2, Training Time : 6 days
- Contrast — Response Normalization
- Overfitting prevention — Data augmentation + Dropout instead of regularisation
- DATASET (ImageNet) — 1.2 million training images, 50,000 validation images, and 150,000 testing images.
- Huge winning margin — Test Error Rate of 15.3% vs 26.2% (second place)
Let’s understand these cool things in detail. The paper employed a number of techniques that were unusual when it came to the state of the art, at that time. Let’s take a look at some of the differentiating features of this paper.
Layers — Depth is ‘uber’ important
The architecture contains eight layers with weights. The first five are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. Softmax takes all the 1000 values, looks at the maximum value and makes it 1 and sets all others to 0.
Connections between layers — Notice how right after the 1st layer we have two parallel paths that are exactly the same. Each path processes a part of the data parallely. Yet there are some computations that are shared e.g. second, fourth, and fifth convolutional layers are connected only to the same path while the third layer is cross-connected to both paths in the second layer.
This cross-connection scheme was a neat trick which reduced their top-1 and top-5 error rates by 1.7% and 1.2%, respectively. This was huge given that they were already ahead of the state of the art.
The depth of this network (number of layers) is so critical that removing any of the middle layer suddenly degrades the accuracy.
Non-Linearity
AlexNet is a Convolutional Neural Network. A neural network is bound to be made up of neurons. Biologically inspired neural networks possess something called an activation function. In simple terms, activation function decides whether the input stimulus is enough for a neuron to fire — i.e. get activated.
AlexNet team chose a non linear activation function with the non-linearity being a Rectified Linear Unit (ReLU). They claimed that it ran much faster than TanH the more popular choice for linearity at the time.
Why do we need a non-linear activation function in an artificial neural network?
Neural networks are used to implement complex functions, and non-linear activation functions enable them to approximate arbitrarily complex functions. Without the non-linearity introduced by the activation function, multiple layers of a neural network are equivalent to a single layer neural network.
The authors showed that ReLU achieved a low error rate much faster than TanH. The popular plot is shown below with the thick line representing ReLU and dashed line, the TanH function.
The keyword being faster. Above all AlexNet needed a faster training time and ReLU helped them. But they needed something more. Something that could transform the speed with which CNNs were computed. This is where the GPUs figured.
GPUs and Training Time
GPUs are devices that can perform parallel computations. Remember how an average laptop is either a Quadcore(4 cores) or an Octacore(8 cores). This refers to the number of parallel computations that can happen in a processor. A GPU can have 1000s of cores leading to a lot of parallelization. AlexNet made use of a GPU that NVIDIA launched a year before AlexNet came out.
The noticeable thing was that AlexNet made use of 2 GPUs in parallel which made their design extremely fast.
Despite the architecture it took AlexNet 6 days to train. But training time was not the only concern. Your accuracy would go bust if your data is not normalized. AlexNet needed to perform some efficient way of normalizing their data. They chose LRN.
Local Response Normalization
In neurobiology, there is a concept called “lateral inhibition”. This refers to the capacity of an excited neuron to subdue its neighbors. The neuron does that to increase the contrast in its surroundings, thereby increasing the sensory perception for that particular are. Local response normalization (LRN) is the computer science way of achieving the same thing.
AlexNet employed LRN to aid generalization. Response normalization reduced their top-1 and top-5 error rates by 1.4% and 1.2%, respectively.
Every CNN has pooling as an essential step. Up until 2012 most pooling schemes involved non-overlapping pools of pixels. AlexNet was ready to experiment with this part of the process.
Overlapping Pooling
Pooling is the process of picking a neighborhood of s x s pixels and summarizing it.
Summarizing can be
- A simple average of all pixel values or
- Majority vote or even,
- A median across the patch of s x s pixels.
Traditionally, these patches were non-overlapping i.e. once an s x s patch is summarized you don’t touch these pixels again and move on to the next s x s patch. They realized that overlapping pooling reduced the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme.
Having tackled normalization and pooling AlexNet was faced with a huge overfitting challenge. Their 60-million parameter model was bound to overfit. They needed to come up with an overfitting prevention strategy that could work at this scale.
Overfitting Prevention
Whenever a system has huge number of parameters, it becomes prone to overfitting. Overfitting is when a model completely adapts itself so religiously to the training data that it fails horribly on test data. That is equivalent of you memorizing all answers in your maths book and failing to understand the formulae behind those answers.
Given a question that you’ve already seen you can answer perfectly but you’ll perform poorly on unseen questions.
With an architecture containing 60 million parameters AlexNet faced a considerable amount of overfitting.
They employed two methods to battle overfitting
- Data Augmentation
- Dropout
Data Augmentation
Data augmentation is increasing the size of your dataset by creating transforms of each image in your dataset. These transforms can be simple scaling of size or reflection or rotation.
These schemes led to an error reduction of 1% in their top-1 error metric. By augmenting the data you not only increase the dataset but the model tries to become rotation invariant, color invariant etc. and prevents overfitting
Dropout
The second technique that AlexNet used to avoid overfitting was dropout. It consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in back- propagation. So every time an input is presented, the neural network samples a different architecture.
This new-architecture-everytime is akin to using multiple architectures without expending additional resources. The model, therefore, forced to learn more robust features.
Dataset
Finally how can we show the magnificence of AlexNet without showing the challenge it faced. ImageNet has in totality 15 million labeled high-resolution images in over 22,000 categories. ILSVRC, the competition, uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.
The 2012 Challenge
AlexNet won the ILSVRC. This was a major breakthrough. Let’s look at what the ask was and what was delivered.
AlexNet also released how feature extraction looked after each layer. This data is available in their supplementary material.
Prof.Hinton who won the Turing’s award this year was apparently not convinced by Alex’s proposed solution. The success of AlexNet goes on to show that with enough grit and determination, innovation does find its way to success.
For a deeper dive into some of the topics mentioned above I have listed various resources I found extremely helpful.
Deep Dive
- Architecture source — ’s article
- ReLU — A very nice blog by
- Activation Function Comparison — I recommend a neat little blog written by Kevin Urban
- Debugging Neural networks
- Data Augmentation
- has put the details of this architecture in a table. The size of the network can be estimated from the fact that it has 62.3 million parameters, and needs 1.1 billion computation units in a forward pass.
X8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it. We publish an article on such simplified AI concepts every Friday. If you liked this or have some feedback or follow-up questions please comment below.
Thanks for Reading!